Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding with_master and release_master! helpers. #209

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kigster
Copy link

@kigster kigster commented Jun 13, 2018

This is just a concept, but i think an important one to have...

stick_to_master!
yield if block_given?
ensure
release_master!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's something here about if it was stuck to master before you ran this block.
Imagine if you were already stuck to master in a request, then called with_master - at the end of the block, I would think you'd recover the previous context (stuck to master).
I can imagine this is why there was an instance variable used in without_sticking and then leveraged in the various methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, ill update.

@chrisb
Copy link

chrisb commented Jul 25, 2018

We have a similar need and ended up with this ActiveRecord::Base helper which scopes master sticking to a block. Feel free to adapt it:

ActiveRecord::Base.class_eval do
  def self.on_master
    previously_stuck = self.connection.send(:stuck_to_master?)
    self.connection.stick_to_master!(true) unless previously_stuck
    yield
  ensure
    unless previously_stuck
      connection_id = self.connection.instance_variable_get(:@id)
      Makara::Context.release(connection_id)
    end
  end
end

MyModel.on_master do 
  # do stuff
end

Curious: is there any way to achieve the same thing with sticky: false?

@kigster
Copy link
Author

kigster commented Jul 25, 2018

That’s exactly what I wanted to achieve, to have on master work with or without sticky flag. Thanks for sharing your block!

@chrisb
Copy link

chrisb commented Jul 25, 2018

@kigster yeah, sadly my snippet only works when sticky is true. If you find a way to force master with sticky disabled, please share!

@camol
Copy link

camol commented Nov 21, 2023

Do you managed to get this done for latest makara?

@kigster
Copy link
Author

kigster commented Nov 21, 2023

Do you managed to get this done for latest makara?

Believe it or not, it's a very timely albeit complex question.


Background

I've been involved with Makara while I was a CTO at Wanelo.com, while Brian Leonard was VPE at TaskRabbit.

Our offices were close by and during one of our lunches he told me about Makara, and despite the fact that TaskRabbit were MySQL users and there was no PostgreSQL support, I instantly knew that this was exactly what my team needed, and we weren't afraid to port it to PostgreSQL. The year was 2010-2012.

while several other gems claimed the ability to spread the db load to a replica, upon further investigation we discovered that none other than Makara were written with multi-threading in mind.

Many people at the time used Unicorn and Resque – both single-threaded multi-process models (and highly memory inefficient).

We were already on Puma and Sidekiq, both multi-threaded gems.

Fast forward to literally right now, and as part of yet another scaling project at my current work, we are taking advantage of the relatively new native read/write split support that was available in Rails 6 & 7.

Having used Makara extensively, and having played with Rails read/write splitting in the last few months, I feel I can make a meaningful comparison.

I know this is not exactly what you asked but bear with me, because the short answer to your question is 'it depends'.

Rails Read/Write Splitting

This approach has some unfortunate limitations:

  • For each primary you can use no more than one replica (for now)

  • If the replica is on the critical path and something happens to it, the app will likely go down because there is no built-in recovery mechanism.

  • In order to take advantage of the replica, it seems like the least risky method is to identify and move large queries that are particularly heavy (good examples are background jobs that do not need to run urgently — being able to start a job with a slight delay, say 2 minutes) solves the majority of issues with replication delay and eventual consistency.

After a considerable effort to incorporate the new replica to help service the traffic, at the moment it still only serves around 10% of the select queries that we've migrated.

Makara

When we first started using Makara at Wanelo in 2011,
our Rails site was peaking at around 250K RPMS.

Advantages of Makara

  • Makara is much more established, and has been in production on sites such as Wanelo and Task Rabbit since 2011; in 2018 I introduced Makara to Homebase (joinhomebase.com) and they never looked back. Obviously, Instacart uses it.

  • I consulted with several other companies that adopted Makara and have since scaled up their traffic orders of magnitude.

  • Not only did Makara allow companies to spread the read traffic across any number of replicas, but it offered a choice to send some arbitrary portion of the read traffic to the primary as well.

  • If your replicas aren't on identical hardware, you could assign weights to each replica, sending more traffic to the faster/larger replica, and less traffic to a smaller one.

  • In addition to scalability Makara offers "fault tolerance": automatic blacklisting of disconnected replicas with an automatic recovery. If one of the replicas dies, Makara would transparently blacklist it for a period of time, and stop sending traffic to that replica, while attempting to reconnect to it behind the scenes.

  • Makara supports "stickiness", which is a rather complicated concept and I am not entirely sold on its universal applicability.

So, what is Stickiness?

Stickiness implies a period of time that a web request thread will "stick to the primary" all select queries following an important DB write operation. Depending on the duration of stickiness, subsequent web requests from the same user may also be forced onto the primary. The goal behind this is, eg. so that you aren't suddenly unable to "see' data you just saved.

It was for this case (very short stickiness, and a small number of very critical queries) that prompted the force_master and retry_on_master helper proposals in this GitHub issue.

Turning on stickiness requires that you additionally and carefully choose the stickiness duration based on the traffic, immediacy requirements of the product, and other constraints.

Alternatives to Stickiness

while cookies may work on the web, for obvious reasons background jobs have no such luxury.

But they have other neat features that more than compensate for clunky stickiness on the web.

If you use Sidekiq, you have access to several very relevant features:

  1. Sidekiq Jobs can be partially or universally delayed by any number of seconds or even minutes. If instead of including Sidekiq::Worker into each job, you create an intermediate module of your own, eg Background::Worker, not only will you use that module to decouple your workers from direct links to Sidekiq, but you can overwrite peform_async() with perform_in(1.minute, *args, **opts)

For instance, sidekiq workers can't use stickiness, since they do not support a coookie.

  • But implementing Makara can accomplish horizontal scalability a lot faster to take advantage of one or more replica(s) by sendin them reads. By configuring the split in database.yml one can quickly divert 50% of select queries from the primary to the replica. Compare that to the "one query at a time" methdod using built-in Rails.

It should come as no surprise that personally I much prefer Makara, but I will be honest it's been a while since I've used it.

Stickiness and Replication Delay

These two concepts are extremely related. If your replicas are able to keep up with a typical replication delay within fractions of a second, then stickiness may not be needed (or can be extremely short say 300ms).

Replocation Delay introduces into the architecture a concept of "evemntual consistency".

TLDR;

I am currently pitching to my company to experiment with Makara. If I am successful, I'd ve more than happy to submit a proper PR with those helpers.

Sorry for the awefully long essay :)

@camol
Copy link

camol commented Nov 21, 2023

Thank you for this. Really helpfull.
Our main intention is to sometime read from master in order to avoid replica-lag which in some cases might be expierienced in our APIs. It is not like we use it everywhere but it really helped us in many really strange cases.

@kigster
Copy link
Author

kigster commented Nov 21, 2023

Keep in mind you can execute a fast and relatively cheap query against replica to compute the replication delay.

If I wrote this, I'd run this periodically this on a single dedicated thread in that Ruby VM. Then did this queries that timing is critical you can query the thread about the latest delay and make your decision accordingly.

@camol
Copy link

camol commented Nov 21, 2023 via email

@kigster
Copy link
Author

kigster commented Nov 21, 2023

@camol Have you considered adding a database level statement timeout?

Often times when replicas lag it's because someone is running a long query on a replica that, in order to finish, must push back on WAL log transfer and application.

@kigster
Copy link
Author

kigster commented Nov 21, 2023

The other trick we used was to have a sidekiq server that was ONLY connected to the master.

The rest (a lot more) read only from the replica.

If we enqueued a job that must get the current data we used the queue that was attached to the primary.

@camol
Copy link

camol commented Nov 21, 2023 via email

@kigster
Copy link
Author

kigster commented Nov 28, 2023

This concept — spreading the reads to a potentially lagging replica — exists for a very specific reason.

It's needed when your traffic goes above what the largest database instance with 1Tb of Ram, 256 CPU cores and a 15-SSD disk array is not enough.

Do NOT use replicas for reads if you need 100% accuracy (like in financial or medical domains).

But absolutely DO use it when you can either tolerate a slight delay by queuing your jobs either a slight delay, and when it's not mission critical if a user on occasion sees old data.

Great examples of apps that might use Makara are social apps, content delivery, chat, etc. Apps that are inherently asynchronous can scale 10X compared to using a single primary, by using many replicas with Makara routing the traffic.

The alternative for 100% accuracy is horizontal partitioning of the data across multiple masters. This is also the only method that works when your scaling problem is not the reads but write IO.

My 2c.

--kig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants